Analysis and visualization of biological data

Nuanced data manipulation and visualization; creating report files

In this part we’ll use some pre-loaded data tables already present in R, then utilize the R-packages tidyr, dplyr, and ggplot2, which are part of the Tidyverse, which is a set of R-packages intended for data science applications, with the same underlying design philosophy, grammar, and data structures.

We’ll go through some basic data-overview methods. Then some of the more important functions that help us to handle our data efficiently. Next, we’ll get acquainted with how we can produce nice figures that best visualize our data. Finally, we’ll put all these into human-readable, relatively aesthetic report files.

Throughout this session, we’ll use the below data tables:

  • Theoph: pharmacokinetics of theophylline
  • esoph: data on the incidence of esophagial cancer in association with alcohol and tobacco consumption
  • iris: sepal length and width, and petal length and width, for 50 flowers from each of 3 species of iris: Iris setosa, I. versicolor, and I. virginica
  • Indometh: pharmacokinetics of indometacin

Further information can be obtained about these by using the help function:

# with the "help()" function:
help(esoph)

# or alternatively:
?esoph

We’ll use the following R-packages:

# main
require(magrittr)
require(tidyr)
require(dplyr)
require(ggplot2)
require(patchwork)

require(knitr)
require(kableExtra)

require(DT)

# optional
require(GGally)
require(ggcorrplot)
require(ggpubr)

1 Handling data

1.1 Overview of data

Data are, in their basic forms, data frames. To get more information about our data even when we’re just using simple functions such as “head()”, we can convert them into “tibble” class. (This is not crucially important: data frames are just fine, but for now, let’s be fancy and use the data frames as tibbles.)

Theoph = as_tibble(Theoph)
esoph = as_tibble(esoph)
iris = as_tibble(iris)

# Theoph
summary(Theoph)
##     Subject         Wt             Dose            Time             conc       
##  6      :11   Min.   :54.60   Min.   :3.100   Min.   : 0.000   Min.   : 0.000  
##  7      :11   1st Qu.:63.58   1st Qu.:4.305   1st Qu.: 0.595   1st Qu.: 2.877  
##  8      :11   Median :70.50   Median :4.530   Median : 3.530   Median : 5.275  
##  11     :11   Mean   :69.58   Mean   :4.626   Mean   : 5.895   Mean   : 4.960  
##  3      :11   3rd Qu.:74.42   3rd Qu.:5.037   3rd Qu.: 9.000   3rd Qu.: 7.140  
##  2      :11   Max.   :86.40   Max.   :5.860   Max.   :24.650   Max.   :11.400  
##  (Other):66
str(Theoph)
## tibble [132 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Subject: Ord.factor w/ 12 levels "6"<"7"<"8"<"11"<..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ Wt     : num [1:132] 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 ...
##  $ Dose   : num [1:132] 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 ...
##  $ Time   : num [1:132] 0 0.25 0.57 1.12 2.02 ...
##  $ conc   : num [1:132] 0.74 2.84 6.57 10.5 9.66 8.58 8.36 7.47 6.89 5.94 ...
##  - attr(*, "formula")=Class 'formula'  language conc ~ Time | Subject
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Time since drug administration"
##   ..$ y: chr "Theophylline concentration in serum"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(hr)"
##   ..$ y: chr "(mg/l)"
head(Theoph)
## # A tibble: 6 × 5
##   Subject    Wt  Dose  Time  conc
##   <ord>   <dbl> <dbl> <dbl> <dbl>
## 1 1        79.6  4.02  0     0.74
## 2 1        79.6  4.02  0.25  2.84
## 3 1        79.6  4.02  0.57  6.57
## 4 1        79.6  4.02  1.12 10.5 
## 5 1        79.6  4.02  2.02  9.66
## 6 1        79.6  4.02  3.82  8.58
# esoph
summary(esoph)
##    agegp          alcgp         tobgp        ncases         ncontrols     
##  25-34:15   0-39g/day:23   0-9g/day:24   Min.   : 0.000   Min.   : 0.000  
##  35-44:15   40-79    :23   10-19   :24   1st Qu.: 0.000   1st Qu.: 1.000  
##  45-54:16   80-119   :21   20-29   :20   Median : 1.000   Median : 4.000  
##  55-64:16   120+     :21   30+     :20   Mean   : 2.273   Mean   : 8.807  
##  65-74:15                                3rd Qu.: 4.000   3rd Qu.:10.000  
##  75+  :11                                Max.   :17.000   Max.   :60.000
str(esoph)
## tibble [88 × 5] (S3: tbl_df/tbl/data.frame)
##  $ agegp    : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ alcgp    : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ tobgp    : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ ncases   : num [1:88] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ncontrols: num [1:88] 40 10 6 5 27 7 4 7 2 1 ...
head(esoph)
## # A tibble: 6 × 5
##   agegp alcgp     tobgp    ncases ncontrols
##   <ord> <ord>     <ord>     <dbl>     <dbl>
## 1 25-34 0-39g/day 0-9g/day      0        40
## 2 25-34 0-39g/day 10-19         0        10
## 3 25-34 0-39g/day 20-29         0         6
## 4 25-34 0-39g/day 30+           0         5
## 5 25-34 40-79     0-9g/day      0        27
## 6 25-34 40-79     10-19         0         7
# iris
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
str(iris)
## tibble [150 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
## # A tibble: 6 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa

1.2 Manipulating data

1.2.1 pipe

%>%: “Pipe an object forward into a function or call expression.”

iris %>% head
## # A tibble: 6 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa

1.2.2 select

?select: “Select (and optionally rename) variables in a data frame, using a concise mini-language that makes it easy to refer to variables based on their name (e.g. a:f selects all columns from a on the left to f on the right) or type (e.g. where(is.numeric) selects all numeric columns).”

# look at the first few rows of selected variables with "head()"
head(select(Theoph, Time, conc))
## # A tibble: 6 × 2
##    Time  conc
##   <dbl> <dbl>
## 1  0     0.74
## 2  0.25  2.84
## 3  0.57  6.57
## 4  1.12 10.5 
## 5  2.02  9.66
## 6  3.82  8.58
# alternatively, with pipe:
Theoph %>% select(Time, conc) %>% head
## # A tibble: 6 × 2
##    Time  conc
##   <dbl> <dbl>
## 1  0     0.74
## 2  0.25  2.84
## 3  0.57  6.57
## 4  1.12 10.5 
## 5  2.02  9.66
## 6  3.82  8.58
# or, in a more "super-fancy-professional" way:
Theoph %>%
  select(Time, conc) %>% 
  head
## # A tibble: 6 × 2
##    Time  conc
##   <dbl> <dbl>
## 1  0     0.74
## 2  0.25  2.84
## 3  0.57  6.57
## 4  1.12 10.5 
## 5  2.02  9.66
## 6  3.82  8.58

Note that this is equivalent to:

head(Theoph[,c("Time", "conc")])
## # A tibble: 6 × 2
##    Time  conc
##   <dbl> <dbl>
## 1  0     0.74
## 2  0.25  2.84
## 3  0.57  6.57
## 4  1.12 10.5 
## 5  2.02  9.66
## 6  3.82  8.58
# or
Theoph[,c("Time", "conc")] %>%
  head
## # A tibble: 6 × 2
##    Time  conc
##   <dbl> <dbl>
## 1  0     0.74
## 2  0.25  2.84
## 3  0.57  6.57
## 4  1.12 10.5 
## 5  2.02  9.66
## 6  3.82  8.58

Exclude specific variables (columns):

Theoph %>%
  select(-Subject) %>%
  head
## # A tibble: 6 × 4
##      Wt  Dose  Time  conc
##   <dbl> <dbl> <dbl> <dbl>
## 1  79.6  4.02  0     0.74
## 2  79.6  4.02  0.25  2.84
## 3  79.6  4.02  0.57  6.57
## 4  79.6  4.02  1.12 10.5 
## 5  79.6  4.02  2.02  9.66
## 6  79.6  4.02  3.82  8.58

1.2.3 filter

?filter: “The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions. Note that when a condition evaluates to NA the row will be dropped, unlike base subsetting with [.”

Theoph %>%
  filter(Subject == 2) %>%
  head
## # A tibble: 6 × 5
##   Subject    Wt  Dose  Time  conc
##   <ord>   <dbl> <dbl> <dbl> <dbl>
## 1 2        72.4   4.4  0     0   
## 2 2        72.4   4.4  0.27  1.72
## 3 2        72.4   4.4  0.52  7.91
## 4 2        72.4   4.4  1     8.31
## 5 2        72.4   4.4  1.92  8.33
## 6 2        72.4   4.4  3.5   6.85
Theoph %>%
  filter(Subject != 2, !is.na(Time)) %>%
  head
## # A tibble: 6 × 5
##   Subject    Wt  Dose  Time  conc
##   <ord>   <dbl> <dbl> <dbl> <dbl>
## 1 1        79.6  4.02  0     0.74
## 2 1        79.6  4.02  0.25  2.84
## 3 1        79.6  4.02  0.57  6.57
## 4 1        79.6  4.02  1.12 10.5 
## 5 1        79.6  4.02  2.02  9.66
## 6 1        79.6  4.02  3.82  8.58

1.2.4 mutate

?mutate: “mutate() creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL).”

Theoph %>%
  mutate(mins = Time*60 ) %>%
  head
## # A tibble: 6 × 6
##   Subject    Wt  Dose  Time  conc  mins
##   <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1        79.6  4.02  0     0.74   0  
## 2 1        79.6  4.02  0.25  2.84  15  
## 3 1        79.6  4.02  0.57  6.57  34.2
## 4 1        79.6  4.02  1.12 10.5   67.2
## 5 1        79.6  4.02  2.02  9.66 121. 
## 6 1        79.6  4.02  3.82  8.58 229.
esoph %>%
  mutate(case.rate = ncases/(ncases+ncontrols) ) %>%
  head
## # A tibble: 6 × 6
##   agegp alcgp     tobgp    ncases ncontrols case.rate
##   <ord> <ord>     <ord>     <dbl>     <dbl>     <dbl>
## 1 25-34 0-39g/day 0-9g/day      0        40         0
## 2 25-34 0-39g/day 10-19         0        10         0
## 3 25-34 0-39g/day 20-29         0         6         0
## 4 25-34 0-39g/day 30+           0         5         0
## 5 25-34 40-79     0-9g/day      0        27         0
## 6 25-34 40-79     10-19         0         7         0

1.2.5 group_by & summarize

esoph %>%
  group_by(agegp) %>%
  summarize(mean(ncases))
## # A tibble: 6 × 2
##   agegp `mean(ncases)`
##   <ord>          <dbl>
## 1 25-34         0.0667
## 2 35-44         0.6   
## 3 45-54         2.88  
## 4 55-64         4.75  
## 5 65-74         3.67  
## 6 75+           1.18
esoph %>%
  group_by(alcgp, tobgp) %>%
  summarize(mean(ncases))
## `summarise()` has grouped output by 'alcgp'. You can override using the
## `.groups` argument.
## # A tibble: 16 × 3
## # Groups:   alcgp [4]
##    alcgp     tobgp    `mean(ncases)`
##    <ord>     <ord>             <dbl>
##  1 0-39g/day 0-9g/day          1.5  
##  2 0-39g/day 10-19             1.67 
##  3 0-39g/day 20-29             1    
##  4 0-39g/day 30+               0.833
##  5 40-79     0-9g/day          5.67 
##  6 40-79     10-19             2.83 
##  7 40-79     20-29             2.5  
##  8 40-79     30+               1.8  
##  9 80-119    0-9g/day          3.17 
## 10 80-119    10-19             3.17 
## 11 80-119    20-29             1.5  
## 12 80-119    30+               1.4  
## 13 120+      0-9g/day          2.67 
## 14 120+      10-19             2    
## 15 120+      20-29             1.4  
## 16 120+      30+               2.5

1.2.6 arrange

?arrange: “arrange() orders the rows of a data frame by the values of selected columns. Unlike other dplyr verbs, arrange() largely ignores grouping; you need to explicitly mention grouping variables (or use .by_group = TRUE) in order to group by them, and functions of variables are evaluated once per data frame, not once per group.”

# arranging in increasing order
esoph %>%
  group_by(agegp, tobgp) %>%
  summarize(mean.cases=mean(ncases)) %>%
  arrange(-mean.cases)
## `summarise()` has grouped output by 'agegp'. You can override using the
## `.groups` argument.
## # A tibble: 24 × 3
## # Groups:   agegp [6]
##    agegp tobgp    mean.cases
##    <ord> <ord>         <dbl>
##  1 65-74 0-9g/day       7.75
##  2 55-64 0-9g/day       6.25
##  3 55-64 10-19          5.75
##  4 55-64 30+            4   
##  5 45-54 0-9g/day       3.5 
##  6 45-54 10-19          3.25
##  7 55-64 20-29          3   
##  8 65-74 10-19          3   
##  9 45-54 30+            2.75
## 10 65-74 20-29          2.5 
## # … with 14 more rows
# arranging in decreasing order
esoph %>%
  group_by(alcgp, tobgp) %>%
  summarize(mean.cases=mean(ncases)) %>%
  arrange(-mean.cases)
## `summarise()` has grouped output by 'alcgp'. You can override using the
## `.groups` argument.
## # A tibble: 16 × 3
## # Groups:   alcgp [4]
##    alcgp     tobgp    mean.cases
##    <ord>     <ord>         <dbl>
##  1 40-79     0-9g/day      5.67 
##  2 80-119    0-9g/day      3.17 
##  3 80-119    10-19         3.17 
##  4 40-79     10-19         2.83 
##  5 120+      0-9g/day      2.67 
##  6 40-79     20-29         2.5  
##  7 120+      30+           2.5  
##  8 120+      10-19         2    
##  9 40-79     30+           1.8  
## 10 0-39g/day 10-19         1.67 
## 11 0-39g/day 0-9g/day      1.5  
## 12 80-119    20-29         1.5  
## 13 80-119    30+           1.4  
## 14 120+      20-29         1.4  
## 15 0-39g/day 20-29         1    
## 16 0-39g/day 30+           0.833

1.2.7 count

Simple counting of levels of categorical variables:

iris %>%
  count(Species)
## # A tibble: 3 × 2
##   Species        n
##   <fct>      <int>
## 1 setosa        50
## 2 versicolor    50
## 3 virginica     50
# alternatively:
table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50
# or
xtabs(~Species, data = iris)
## Species
##     setosa versicolor  virginica 
##         50         50         50

Cross tabulation (a.k.a. contingency table, i.e. checking the number of observations corresponding to levels from two variables):

esoph %>%
  count(agegp, alcgp)
## # A tibble: 24 × 3
##    agegp alcgp         n
##    <ord> <ord>     <int>
##  1 25-34 0-39g/day     4
##  2 25-34 40-79         4
##  3 25-34 80-119        3
##  4 25-34 120+          4
##  5 35-44 0-39g/day     4
##  6 35-44 40-79         4
##  7 35-44 80-119        4
##  8 35-44 120+          3
##  9 45-54 0-39g/day     4
## 10 45-54 40-79         4
## # … with 14 more rows
# alternatively:
table(esoph$agegp, esoph$alcgp)
##        
##         0-39g/day 40-79 80-119 120+
##   25-34         4     4      3    4
##   35-44         4     4      4    3
##   45-54         4     4      4    4
##   55-64         4     4      4    4
##   65-74         4     3      4    4
##   75+           3     4      2    2
# or
xtabs(~agegp+alcgp, data = esoph)
##        alcgp
## agegp   0-39g/day 40-79 80-119 120+
##   25-34         4     4      3    4
##   35-44         4     4      4    3
##   45-54         4     4      4    4
##   55-64         4     4      4    4
##   65-74         4     3      4    4
##   75+           3     4      2    2

1.2.8 reshaping data frames

In general, data tables can either be in a “long” or a “wide” format, which refers to how observations are recorded in the table. In the long format the repeated measurements are in separate rows and the corresponding name is coded as another variable; in the wide format, repeated measurements are in separate vairables (columns) but in the same row. Some analysis or visualization methods require specific data structures, for which it often may be necessary to reshape our data frames.

table formats

Wide to long:

head(iris)
## # A tibble: 6 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa
iris.long = iris %>%
  pivot_longer(
    cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width), 
    names_to = "flower.part", values_to = "value")
head(iris.long)
## # A tibble: 6 × 3
##   Species flower.part  value
##   <fct>   <chr>        <dbl>
## 1 setosa  Sepal.Length   5.1
## 2 setosa  Sepal.Width    3.5
## 3 setosa  Petal.Length   1.4
## 4 setosa  Petal.Width    0.2
## 5 setosa  Sepal.Length   4.9
## 6 setosa  Sepal.Width    3
# alternatively
iris.long = iris %>%
  pivot_longer(
    cols = c(Sepal.Length:Petal.Width), 
    names_to = "flower.part", values_to = "value")
head(iris.long)
## # A tibble: 6 × 3
##   Species flower.part  value
##   <fct>   <chr>        <dbl>
## 1 setosa  Sepal.Length   5.1
## 2 setosa  Sepal.Width    3.5
## 3 setosa  Petal.Length   1.4
## 4 setosa  Petal.Width    0.2
## 5 setosa  Sepal.Length   4.9
## 6 setosa  Sepal.Width    3

Long to wide:

Indometh %>%
  pivot_wider(
    names_from = time, 
    values_from = conc, names_prefix = "timepoint_")
## # A tibble: 6 × 12
##   Subject timepoint_0.25 timep…¹ timep…² timep…³ timep…⁴ timep…⁵ timep…⁶ timep…⁷
##   <ord>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1                 1.5     0.94    0.78    0.48    0.37    0.19    0.12    0.11
## 2 2                 2.03    1.63    0.71    0.7     0.64    0.36    0.32    0.2 
## 3 3                 2.72    1.49    1.16    0.8     0.8     0.39    0.22    0.12
## 4 4                 1.85    1.39    1.02    0.89    0.59    0.4     0.16    0.11
## 5 5                 2.05    1.04    0.81    0.39    0.3     0.23    0.13    0.11
## 6 6                 2.31    1.44    1.03    0.84    0.64    0.42    0.24    0.17
## # … with 3 more variables: timepoint_5 <dbl>, timepoint_6 <dbl>,
## #   timepoint_8 <dbl>, and abbreviated variable names ¹​timepoint_0.5,
## #   ²​timepoint_0.75, ³​timepoint_1, ⁴​timepoint_1.25, ⁵​timepoint_2, ⁶​timepoint_3,
## #   ⁷​timepoint_4
# first reshape iris into long format, but now with a row-ID
iris.2 = iris
iris.2$id = 1:nrow(iris.2)
iris.long = iris.2 %>%
  pivot_longer(
    cols = c(Sepal.Length:Petal.Width), 
    names_to = "flower.part", values_to = "value")

# long to wide
iris.long %>%
  pivot_wider(
    names_from = flower.part, 
    values_from = value, id_cols = c(id, Species)) %>%
  head
## # A tibble: 6 × 6
##      id Species Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <int> <fct>          <dbl>       <dbl>        <dbl>       <dbl>
## 1     1 setosa           5.1         3.5          1.4         0.2
## 2     2 setosa           4.9         3            1.4         0.2
## 3     3 setosa           4.7         3.2          1.3         0.2
## 4     4 setosa           4.6         3.1          1.5         0.2
## 5     5 setosa           5           3.6          1.4         0.2
## 6     6 setosa           5.4         3.9          1.7         0.4

1.3 Showcasing tables

1.3.1 knitr & kableExtra

esoph %>%
  head() %>%
  knitr::kable() %>%
  kable_styling() %>%
  print()
agegp alcgp tobgp ncases ncontrols
25-34 0-39g/day 0-9g/day 0 40
25-34 0-39g/day 10-19 0 10
25-34 0-39g/day 20-29 0 6
25-34 0-39g/day 30+ 0 5
25-34 40-79 0-9g/day 0 27
25-34 40-79 10-19 0 7
header_vec = c(1,4,1) # +1 because we also display rownames
names(header_vec) = c(" ", "Flower parts", " ")
iris %>%
  head() %>%
  knitr::kable(row.names=T, align = "c") %>%
  kable_styling(full_width = F) %>%
  add_header_above(header_vec) %>%
  print()
Flower parts
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

1.3.2 DT

iris %>%
  datatable(options = list(pageLength = 10))

2 Visualizing data

There is a huge number of ways by which data visualization can be carried out, even within R: from the most basic functions such as “plot()”, “boxplot()”, and “hist()”, to very complex plotting packages, like plotly. One of the most flexible, robust, and widely used among the graphical R-packages is ggplot2 (and its numerous complementing packages to improve and expand its functionality).

“ggplot2 has an underlying grammar, based on the Grammar of Graphics (Wilkinson, 2005), that allows you to compose graphs by combining independent components. This makes ggplot2 powerful. Rather than being limited to sets of pre-defined graphics, you can create novel graphics that are tailored to your specific problem.”

The main idea is to provide layers of visualization and the grammar is following the “logic of layers”, as each component or modification to the figure we’d like to create comes from flexible but specialized functions. All ggplot2 figures are built from two components: the data and a mapping. For the latter, additional elements include the layer (with geometric and statistical information), scale, coordinate system, faceting, and theme.

2.1 Basic plots

# scatterplot

p1 = ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point()
p1

# boxplot
p2 = ggplot(iris, aes(x = Species, y = Sepal.Width)) +
  geom_boxplot()
p2

# barplot
df.1 = as.data.frame(table(esoph$agegp))
p3 = ggplot(df.1, aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity")
p3

# or, alternatively for the above barplot:
p3.2 = ggplot(esoph, aes(x = agegp, y = after_stat(count))) +
  geom_bar()
p3.2

# histogram
p4 = ggplot(Theoph, aes(x = conc)) +
  geom_histogram(bins = 50)
p4

# density plot
p4.2 = ggplot(Theoph, aes(x = conc)) +
  geom_density()
p4.2

2.2 Shaping plots

There is a large number of pre-defined colors in R: R color chart. Also, on scatter plots and line plots the shape of visualzing elements can be altered as well, using pre-defined plot symbol shapes and line types.

p5 = ggplot(esoph, aes(ncases)) +
  geom_bar(aes(fill=agegp))
p5

p6 = ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species, shape = Species)) +
  geom_point(size = 3, alpha = 0.5) +
  stat_ellipse(type="t") +
  scale_color_manual(values = c(setosa = "cyan", versicolor = "red", virginica = "green")) +
  scale_shape_manual(values = c(setosa = 15, versicolor = 16, virginica = 17)) +
  xlab("Sepal length") +
  ylab("Sepal width") +
  ggtitle("iris data") + 
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))
p6

# example using the long format of iris:
p6.2 = ggplot(iris.long, aes(value, fill=Species)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~flower.part, scales = "free") +
  theme_bw()
p6.2

p7 = ggplot(Theoph, aes(x = Time, y = conc, color = Subject, linetype = Subject)) +
  geom_line() +
  geom_point() +
  theme_bw()
p7

p8 = ggplot(iris, aes(x = Species, y = Sepal.Width, color = Species)) +
  geom_boxplot() +
  geom_jitter(size = 3, alpha = 0.9) +
  xlab("spp.") +
  ylab("Sepal width") +
  coord_flip() +
  theme_bw()
p8

# with denoting of significant group differences, using the R-package "ggpubr"
# also note that in this case, not "color" is used in "aes()", but "fill"!
# (Note: "stat_compare_means" uses simple t-test by default, so should only be used 
# on data with normal distribution.)
## "method" can be either one of these:
## t.test, wilcox.test, kruskal.test, anova
comparison.list = list(
  c("setosa", "versicolor"),
  c("setosa", "virginica"),
  c("versicolor", "virginica")
)
p8.2 = ggplot(iris, aes(x = Species, y = Sepal.Width, fill = Species)) +
  geom_boxplot(alpha = 0.5) +
  xlab("spp.") +
  ylab("Sepal width") +
  theme_bw() +
  stat_compare_means(
    method = "t.test",
    comparisons = comparison.list,
    label = "p.signif",
    symnum.args = list(cutpoints = c(0, 0.001, 0.01, 0.05, 1), 
         symbols = c("***", "**", "*", "ns"))
  )
p8.2

# alternatively, with P-values:
p8.3 = ggplot(iris, aes(x = Species, y = Sepal.Width, fill = Species)) +
  geom_boxplot(alpha = 0.5) +
  xlab("spp.") +
  ylab("Sepal width") +
  theme_bw() +
  stat_compare_means(
    comparisons = comparison.list,
    label = "p.format"
  )
p8.3

p9 = ggplot(iris, aes(x = Species, y = Sepal.Width, color = Species)) +
  geom_violin(draw_quantiles = c(0.5), fill = NA) +
  geom_jitter(size = 3, alpha = 0.5) +
  xlab("spp.") +
  ylab("Sepal width") +
  coord_flip() +
  theme_bw()
p9

2.3 Facets by groups

p7.2 = ggplot(Theoph, aes(x = Time, y = conc, color = Subject, linetype = Subject)) +
  geom_line() +
  geom_point() +
  facet_wrap(~Dose) +
  theme_bw()
p7.2

esoph = esoph %>%
  mutate(rate = ncases/sum(ncases+ncontrols))
esoph$line.group = paste(esoph$tobgp, esoph$alcgp, sep="__")
p7.3 = ggplot(esoph, aes(x = agegp, y = rate, color = agegp)) +
  geom_point(size = 3) +
  facet_grid(tobgp~alcgp) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90)) +
  geom_line(
    inherit.aes = F, data = esoph, 
    mapping = aes(x = agegp, y = rate, color = agegp, group = line.group))
p7.3

2.4 geom_smooth

p10 = ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 3, alpha = 0.5) +
  geom_smooth(method = "lm", formula = y~x, se = T) +
  xlab("Sepal length") +
  ylab("Sepal width") +
  theme_bw()
p10

p11 = ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 3, alpha = 0.5) +
  geom_smooth(method = "loess", formula = y~x, se = F) +
  xlab("Sepal length") +
  ylab("Sepal width") +
  theme_bw()
p11

p12 = ggplot(Theoph, aes(x = Time, y = conc, color = Subject)) +
  geom_point(size = 3, alpha = 0.5) +
  geom_smooth(data = Theoph, mapping = aes(x = Time, y = conc), 
              formula = y~x, method = "loess", inherit.aes = F) +
  theme_bw()
p12

2.5 Some more…

require(ggalt)
p6.2 = ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species, shape = Species)) +
  geom_point(size = 3, alpha = 0.5) +
  geom_encircle() +
  scale_shape_manual(values = c(setosa = 15, versicolor = 16, virginica = 17)) +
  xlab("Sepal length") +
  ylab("Sepal width") +
  ggtitle("iris data") + 
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))
p6.2

require(GGally)
ggpairs(iris, mapping = aes(color = Species, alpha=0.5)) +
  theme_bw()

require(Hmisc)
Cmat = rcorr(as.matrix(iris[,1:4]))
Cmat$P[is.na(Cmat$P)] = 0
ggcorrplot(
    corr = Cmat$r, p.mat = Cmat$P, 
    method = "circle", hc.order = TRUE, 
    sig.level = 0.05) + 
  theme(plot.title = element_text(hjust = 0.5))

require(ggExtra)
p13 = ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 3, alpha = 0.5) +
  geom_smooth(method = "lm", formula = y~x, se = T) +
  xlab("Sepal length") +
  ylab("Sepal width") +
  theme_bw()

ggMarginal(p13, type = "density", groupColour = T, fill = "transparent")

ggMarginal(p13, type = "boxplot", groupColour = T)

2.6 Arranging multiple plots

A useful package, which can help to create publication-ready composite figures: patchwork.

require(patchwork)

p1 + p2

(p10 | p9) / p4

p5 = p5 + theme_bw()
p7 = p7 + guides(color="none", linetype="none")
p12 = p12 + guides(color="none")
p7.2 / (p5 + p7 + p12)

2.7 Export plots

# save latest printed plot
ggsave(
  filename = "plot_p7.png",
  path = "path/to/folder",
  device = "png", 
  dpi = 500, 
  width = 5, 
  height = 5, 
  units = "in"
)

# save one specific plot
ggsave(
  plot = p7,
  filename = "plot_p7.png",
  path = "path/to/folder",
  device = "png", 
  dpi = 500, 
  width = 5, 
  height = 5, 
  units = "in"
)

3 Practice

The practicing task will be simply to use the iris data table, and use some of the data manipulation methods from above, then create a simple docx report file.

  1. Print out the mean values of Sepal.Length separately for each species from iris, using the group_by() and summarize() functions!
  2. Using the presented techniques, create a data table named iris.r1 in which only versicolor species are retained!
  3. In this new iris.r1 create a new variable called Petal.mean, which contains the mean of Petal.Length and Petal.Width!
  4. Arrange the observations within the iris.r1 by sepal length, and print the first 6 rows!
  5. Create the long format version of iris.r1 and save it in a new variable as iris.r1.long!
  6. Create a html report file, in which:
    • code chunks are not shown;
    • warnings and messages are not shown from the code chunks;
    • there is a header with your name;
    • there is a simple text block with the Description from the help of the iris data table (which help can be accessed with ?iris);
    • the head of the iris.r1.long table is printed out, using the kintr and kableExtra packages;
    • there is a boxplot enhanced with jitter (geom_jitter) to visualize flower part values from iris.r1.long, in a way that (i) flower parts are shown on the x-axis (horizontal axis) and measurement values are shown on the y-axis (vertical axis), and (ii) coloration is done based on flower parts!
  7. Create a docx report, which only includes the header, the description text, and the boxplot!